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Abstract 

There has been a lot of recent interest in designing 
neural network models to estimate a distribution 
from a set of examples. We introduce a simple 
modification for autoencoder neural networks that 
yields powerful generative models. Our method 
masks the autoencoder’s parameters to respect 
autoregressive constraints: each input is recon¬ 
structed only from previous inputs in a given or¬ 
dering. Constrained this way, the autoencoder 
outputs can be interpreted as a set of conditional 
probabilities, and their product, the full joint prob¬ 
ability. We can also train a single network that 
can decompose the joint probability in multiple 
different orderings. Our simple framework can be 
applied to multiple architectures, including deep 
ones. Vectorized implementations, such as on 
GPUs, are simple and fast. Experiments demon¬ 
strate that this approach is competitive with state- 
of-the-art tractable distribution estimators. At test 
time, the method is significantly faster and scales 
better than other autoregressive estimators. 

1. Introduction 

Distribution estimation is the task of estimating a joint distri¬ 
bution p(x) from a set of examples {x^^}^ =1 , which is by 
definition a general problem. Many tasks in machine learn¬ 
ing can be formulated as learning only specific properties of 
a joint distribution. Thus a good distribution estimator can 
be used in many scenarios, including classification (Schmah 
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et al., 2009), denoising or missing input imputation (Poon 
& Domingos, 2011; Dinh et al., 2014), data (e.g. speech) 
synthesis (Uria et al., 2015) and many others. The very 
nature of distribution estimation also makes it a particular 
challenge for machine learning. In essence, the curse of 
dimensionality has a distinct impact because, as the number 
of dimensions of the input space of x grows, the volume of 
space in which the model must provide a good answer for 
p(x) exponentially increases. 

Fortunately, recent research has made substantial progress 
on this task. Specifically, learning algorithms for a vari¬ 
ety of neural network models have been proposed (Bengio 
& Bengio, 2000; Larochelle & Murray, 2011; Gregor & 
LeCun, 2011; Uria et al., 2013; 2014; Kingma & Welling, 
2014; Rezende et al., 2014; Bengio et al., 2014; Gregor 
et al., 2014; Goodfellow et al., 2014; Dinh et al., 2014). 
These algorithms are showing great potential in scaling to 
high-dimensional distribution estimation problems. In this 
work, we focus our attention on autoregressive models (Sec¬ 
tion 3). Computing p(x) exactly for a test example x is 
tractable with these models. However, the computational 
cost of this operation is still larger than typical neural net¬ 
work predictions for a D-dimensional input. For previous 
deep autoregressive models, evaluating p(x) costs O(D) 
times more than a simple neural network point predictor. 

This paper’s contribution is to describe and explore a simple 
way of adapting autoencoder neural networks that makes 
them competitive tractable distribution estimators that are 
faster than existing alternatives. We show how to mask the 
weighted connections of a standard autoencoder to convert it 
into a distribution estimator. The key is to use masks that are 
designed in such a way that the output is autoregressive for a 
given ordering of the inputs, i.e. that each input dimension is 
reconstructed solely from the dimensions preceding it in the 





MADE: Masked Autoencoder for Distribution Estimation 


ordering. The resulting Masked Autoencoder Distribution 
Estimator (MADE) preserves the efficiency of a single pass 
through a regular autoencoder. Implementation on a GPU is 
straightforward, making the method scalable. 

The single hidden layer version of MADE corresponds to the 
previously proposed autoregressive neural network of Ben- 
gio & Bengio (2000). Here, we go further by exploring 
deep variants of the model. We also explore training MADE 
to work simultaneously with multiple orderings of the in¬ 
put observations and hidden layer connectivity structures. 
We test these extensions across a range of binary datasets 
with hundreds of dimensions, and compare its statistical 
performance and scaling to comparable methods. 

2. Autoencoders 

A brief description of the basic autoencoder, on which this 
work builds upon, is required to clearly grasp what follows. 
In this paper, we assume that we are given a training set of 
examples {xW}^ =1 . We concentrate on the case of binary 
observations, where for every D-dimensional input x, each 
input dimension Xd belongs in {0,1}. The motivation is 
to learn hidden representations of the inputs that reveal the 
statistical structure of the distribution that generated them. 

An autoencoder attempts to learn a feed-forward, hidden 
representation h(x) of its input x such that, from it, we can 
obtain a reconstruction x which is as close as possible to x. 
Specifically, we have 

h(x) = g(b + Wx) (1) 

x = sigm(c + Vh(x)), (2) 

where W and V are matrices, b and c are vectors, g is a non¬ 
linear activation function and sigm(a) = 1/(1 + exp(—a)). 
Thus, W represents the connections from the input to the 
hidden layer, and V represents the connections from the 
hidden to the output layer. 

To train the autoencoder, we must first specify a training 
loss function. For binary observations, a natural choice is 
the cross-entropy loss: 

D 

£(x.) = ^ ~2~x d \ogx d - (l-x d )\og(l-x d ) . (3) 

d—1 

By treating x d as the model’s probability that x d is 1, the 
cross-entropy can be understood as taking the form of a 
negative log-likelihood function. Training the autoencoder 
corresponds to optimizing the parameters {W, V, b, c} to 
reduce the average loss on the training examples, usually 
with (mini-batch) stochastic gradient descent. 

One advantage of the autoencoder paradigm is its flexibility. 
In particular, it is straightforward to obtain a deep autoen¬ 
coder by inserting more hidden layers between the input 


and output layers. Its main disadvantage is that the repre¬ 
sentation it learns can be trivial. For instance, if the hidden 
layer is at least as large as the input, hidden units can each 
learn to “copy” a single input dimension, so as to recon¬ 
struct all inputs perfectly at the output layer. One obvious 
consequence of this observation is that the loss function 
of Equation 3 isn’t in fact a proper log-likelihood func¬ 
tion. Indeed, since perfect reconstruction could be achieved, 
the implied data ‘distribution’ q(x) = x^ d (l — x d ) 1 ~ Xd 
could be learned to be 1 for any x and thus not be properly 
normalized (£^ x q(pt) ^ 1). 

3. Distribution Estimation as Autoregression 

An interesting question is what property we could impose 
on the autoencoder, such that its output can be used to obtain 
valid probabilities. Specifically, we’d like to be able to write 
p(x) in such a way that it could be computed based on the 
output of a properly corrected autoencoder. 

First, we can use the fact that, for any distribution, the prob¬ 
ability product rule implies that we can always decompose 
it into the product of its nested conditionals 

D 

p(x) = ~[[p(xd\x<d), (4) 

d= 1 

where x <d = [tci,. . .,x d -i] T . 

By defining p[x d = 11 x <d ) = x d , and thus p(x d = 
0 | x<cd) = 1 — x d , the loss of Equation 3 becomes a valid 
negative log-likelihood: 

D 

~ log p(x) = ^2~ log p{x d | x <d ) 

d— 1 
D 

= -Xd,\ogp(x d = l | X <d ) (5) 

d= 1 

- (1 — \ogp(x d = 0 | X <d ) 

= *(x) ■ 

This connection provides a way to define autoencoders 
that can be used for distribution estimation. Each output 
x d = p{xd | x <r /) must be a function taking as input x <d 
only and outputting the probability of observing value x d 
at the d th dimension. In particular, the autoencoder forms 
a proper distribution if each output unit x d only depends 
on the previous input units x <( /, and not the other units 
x>d = [x d , ■ ■ ■ ,x d ] t . 

We refer to this property as the autoregressive property, 
because computing the negative log-likelihood (5) is equiv¬ 
alent to sequentially predicting (regressing) each dimension 
of input x. 
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4. Masked Autoencoders 

The question now is how to modify the autoencoder so as 
to satisfy the autoregressive property. Since output Xd must 
depend only on the preceding inputs x<,/, it means that 
there must be no computational path between output unit 
Xd and any of the input units x,i -..., Xjj. In other words, 
for each of these paths, at least one connection (in matrix 
W or V) must be 0. 

A convenient way of zeroing connections is to elementwise- 
multiply each matrix by a binary mask matrix, whose entries 
that are set to 0 correspond to the connections we wish to 
remove. For a single hidden layer autoencoder, we write 

h(x) = g(b + (W © M w )x) (6) 

x = sigm(c + (V 0 M v )h(x)) (7) 

where M w and M v are the masks for W and V respec¬ 
tively. It is thus left to the masks M w and M v to satisfy 
the autoregressive property. 


is the number of network paths between output unit 
Xd' and input unit x,i- Thus, to demonstrate the autoregres¬ 
sive property, we need to show that M v,w is strictly lower 
diagonal, i.e. M is 0 if d! < d. By definition of the 
matrix product, we have: 

K K 

= Y. M d',k M Z = E l d’>m(k)lm(k)> d . (10) 

k=l k =1 

If d! < d, then there are no values for ni ( k ) such that it is 
both strictly less than d' and greater or equal to d. Thus 
M r ^’J V is indeed 0. 

Constructing the masks M v and M w only requires an as¬ 
signment of the m{k) values to each hidden unit. One could 
imagine trying to assign an (approximately) equal number 
of units to each legal value of m(k). In our experiments, we 
instead set m(k) by sampling from a uniform discrete dis¬ 
tribution defined on integers from 1 to D — 1, independently 
for each of the K hidden units. 


To impose the autoregressive property we first assign each 
unit in the hidden layer an integer m between 1 and D— 1 
inclusively. The k th hidden unit’s number m(k) gives the 
maximum number of input units to which it can be con¬ 
nected. We disallow m(k) = D since this hidden unit would 
depend on all inputs and could not be used in modelling 
any of the conditionals p(xd | x <fJ r). Similarly, we exclude 
m(k) = 0, as it would create constant hidden units. 

The constraints on the maximum number of inputs to each 
hidden unit are encoded in the matrix masking the connec¬ 
tions between the input and hidden units: 


Previous work on autoregressive neural networks have also 
found it advantageous to use direct connections between the 
input and output layers (Bengio & Bengio, 2000). In this 
context, the reconstruction becomes: 

x = sigm(c + (V © M v )h(x) + (A © M A )x), (11) 

where A is the parameter connection matrix and M A is its 
mask matrix. To satisfy the autoregressive property, M A 
simply needs to be a strictly lower diagonal matrix, filled 
otherwise with ones. We used such direct connections in 
our experiments as well. 


Mk,d lm(fc)>d 


1 if m(k) > d 
0 otherwise. 


( 8 ) 


for d £ {1,..., D} and k £ {1,..., K}. Overall, we need 
to encode the constraint that the d th output unit is only 
connected to x <( j (and thus not to x>d). Therefore the 
output weights can only connect the d th output to hidden 
units with m(k) < d , i.e. units that are connected to at most 
d— 1 input units. These constraints are encoded in the output 
mask matrix: 


^d,k 1 d>m(k) 


1 if d > m(k) 

0 otherwise. 


again for d £ {1,..., D} and k £ {1,..., I\ }. Notice that, 
from this rule, no hidden units will be connected to the first 
output unit X\, as desired. 

From these mask constructions, we can easily demonstrate 
that the corresponding masked autoencoder satisfies the au¬ 
toregressive property. First, we note that, since the masks 
M v and M w represent the network’s connectivity, their 
matrix product M v,w = M V M W represents the connec¬ 
tivity between the input and the output layer. Specifically, 


4.1. Deep MADE 

One advantage of the masked autoencoder framework de¬ 
scribed in the previous section is that it naturally generalizes 
to deep architectures. Indeed, as we’ll see, by assigning a 
maximum number of connected inputs to all units across 
the deep network, masks can be similarly constructed so as 
to satisfy the autoregressive property. 

For networks with L > 1 hidden layers, we use superscripts 
to index the layers. The first hidden layer matrix (previously 
W) will be denoted W 1 , the second hidden layer matrix will 
be W 2 , and so on. The number of hidden units (previously 
K) in each hidden layer will be similarly indexed as I \, 
where l is the hidden layer index. We will also generalize 
the notation for the maximum number of connected inputs 
of the k th unit in the I th layer to m l (k). 

We’ve already discussed how to define the first layer’s mask 
matrix such that it ensures that its k th unit is connected to 
at most m(k) (now m 1 (/c)) inputs. To impose the same 
property on the second hidden layer, we must simply make 
sure that each unit k! is only connected to first layer units 
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Figure 1. Left: Conventional three hidden layer autoencoder. 

Input in the bottom is passed through fully connected layers and 
point-wise nonlinearities. In the final top layer, a reconstruction 
specified as a probability distribution over inputs is produced. 
As this distribution depends on the input itself, a standard au¬ 
toencoder cannot predict or sample new data. Right: MADE. 
The network has the same structure as the autoencoder, but a set 
of connections is removed such that each input unit is only pre¬ 
dicted from the previous ones, using multiplicative binary masks 
(M w \ M w “, M v ). In this example, the ordering of the input 
is changed from 1,2,3 to 3,1,2. This change is explained in sec¬ 
tion 4.2, but is not necessary for understanding the basic principle. 
The numbers in the hidden units indicate the maximum number 
of inputs on which the fc th unit of layer l depends. The masks are 
constructed based on these numbers (see Equations 12 and 13). 
These masks ensure that MADE satisfies the autoregressive prop¬ 
erty, allowing it to form a probabilistic model, in this example 
p(x) = p(x 2 ) p(x 3 \x 2 ) p(xi\x 2 , x 3 ). Connections in light gray 
correspond to paths that depend only on 1 input, while the dark 
gray connections depend on 2 inputs. 


connected to at most m 2 (k') inputs, i.e. the first layer units 
such that to 1 (fc) < m 2 (k'). 


to be greater than or equal to the minimum connectivity at 
the previous layer, i.e. min^/ m l ~ 1 (k'). 

4.2. Order-agnostic training 

So far, we’ve assumed that the conditionals modelled by 
MADE were consistent with the natural ordering of the 
dimensions of x. However, we might be interested in mod¬ 
elling the conditionals associated with an arbitrary ordering 
of the input’s dimensions. 

Specifically, Uria et al. (2014) have shown that training 
an autoregressive model on all orderings can be beneficial. 
We refer to this approach as order-agnostic training. It can 
be achieved by sampling an ordering before each stochas¬ 
tic/minibatch gradient update of the model. There are two 
advantages of this approach. Firstly, missing values in par¬ 
tially observed input vectors can be imputed efficiently: we 
invoke an ordering where observed dimensions are all be¬ 
fore unobserved ones, making inference straightforward. 
Secondly, an ensemble of autoregressive models can be con¬ 
structed on the fly, by exploiting the fact that the conditionals 
for two different orderings are not guaranteed to be exactly 
consistent (and thus technically correspond to slightly dif¬ 
ferent models). An ensemble is then easily obtained by 
sampling a set of orderings, computing the probability of x 
under each ordering and averaging. 

Conveniently, in MADE, the ordering is simply represented 
by the vector m° = [to°(1), ..., m°(D)\. Specifically, 
m°(d) corresponds to the position of the original d th dimen¬ 
sion of x in the product of conditionals. Thus, a random 
ordering can be obtained by randomly permuting the or¬ 
dered vector [1, ..., D], From these values of each m°, the 
first hidden layer mask matrix can then be created. During 
order-agnostic training, randomly permuting the last value 
of m° again is sufficient to obtain a new random ordering. 


One can generalize this rule to any layer l, as follows: 


M™ k 


= 1 , 


1 


HO 


1 if m l (k') > m l 1 (fc) 

0 otherwise. 


( 12 ) 

Also, taking Z = 0 to mean the input layer and defining 
m° (d) = d (which is intuitive, since the d th input unit in¬ 
deed takes its values from the d first inputs), this definition 
also applies for the first hidden layer weights. As for the 
output mask, we simply need to adapt its definition by using 
the connectivity constraints of the last hidden layer m L {k) 
instead of the first: 


Md^k 1 d>m L (k) 


1 if d > m L (k ) 
0 otherwise. 


(13) 


Like for the single hidden layer case, the values for m l {k) 
for each hidden layer l £ {1,..., L} are sampled uniformly. 
To avoid unconnected units, the value for m l (k) is sampled 


4.3. Connectivity-agnostic training 

One advantage of order-agnostic training is that it effectively 
allows us to train as many models as there are orderings, 
using a common set of parameters. This can be exploited 
by creating ensembles of models at test time. 

In MADE, in addition to choosing an ordering, we also have 
to choose each hidden unit’s connectivity constraint m l {k). 
Thus, we could imaging training MADE to also be agnostic 
of the connectivity pattern generated by these constraints. To 
achieve this, instead of sampling the values of m l {k) for all 
units and layers once and for all before training, we actually 
resample them for each training example or minibatch. This 
is still practical, since the operation of creating the masks is 
easy to parallelize. Denoting m ( = [m l m\K l )\, 

and assuming an element-wise and parallel implementation 
of the operation l a >b for vectors, such that l a >b is a matrix 
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Algorithm 1 Computation of p(x) and learning gradients 
for MADE with order and connectivity sampling. D is the 
size of the input, L the number of hidden layers and K the 
number of hidden units. 

Input: training observation vector x 

Output: p(x) and gradients of — log p(x) on parameters 

# Sampling m ( vectors 
m° •<— shuffle([l, ... ,D]) 

for l from 1 to L do 
for k from 1 to K l do 

m l {k) <— Uniform([minfc/ m l ~ 1 (k'),... ,D — 1]) 

end for 
end for 

# Constructing masks for each layer 
for l from 1 to L do 

M W t— 

end for 

M v t— l m o >m t 

# Computing p(x) 
h°(x) t— x 

for l from 1 to L do 

h'(x) t- g(b ; + (W'QMW'jh^fx)) 

end for 

x <- sigm(c + (V 0 M v )h L (x)) 

p(x) «- exp (j2d= l x d log Xd, + (l-Xd)log(l-Xd)) 

# Computing gradients of — logp(x) 
tmp <r- x — x 

5c ■<— tmp 

5V <— (tmp h L (x) T ) 0 M v 
tmp <— (tmp T (V O M v )) t 
for l from L to ldo 

tmp <- tmp © g'(b z + (W* © M w ' )h z " 1 (x)) 

(5b z <— tmp 

d'W 1 <r- (tmp h i_1 (x) T ) 0 M w 
tmp -t— (tmp T (W ; © M w ')) t 

end for 

return p(x), (5b 1 ,..., <5b L , (5W 1 ,..., <5W L , 5c , 6V 


whose i,j element is 1 ai >b , then the hidden layer masks 
are simply M wi = 

By resampling the connectivity of hidden units for every 
update, each hidden unit will have a constantly changing 
number of incoming inputs during training. However, the 
absence of a connection is indistinguishable from an instan¬ 
tiated connection to a zero-valued unit, which could confuse 
the neural network during training. In a similar situation, 
Uria et al. (2014) informed each hidden unit which units 


were providing input with binary indicator variables, con¬ 
nected with additional learnable weights. We considered 
applying a similar strategy, using companion weight matri¬ 
ces U / , that are also masked by M w but connected to a 
constant one-valued vector: 

h z (x) = g(b z + (W* © M w ' )h^ 1 (x) + (U* © M w ')l) 

(14) 

An analogous parametrization of the output layer was also 
employed. These connectivity conditioning weights were 
only sometimes useful. In our experiments, we treated the 
choice of using them as a hyperparameter. 

Moreover, we’ve found in our experiments that sampling 
masks for every example could sometimes over-regularize 
MADE and provoke underfitting. To fix this issue, we also 
considered sampling from only a finite list of masks. During 
training, MADE cycles through this list, using one for every 
update. At test time, we then average probabilities obtained 
for all masks in the list. 

Algorithm 1 details how p(x) is computed by MADE, as 
well as how to obtain the gradient of £(x) for stochastic 
gradient descent training. For simplicity, the pseudocode 
assumes order-agnostic and connectivity-agnostic training, 
doesn’t assume the use of conditioning weight matrices or 
of direct input/output connections. Figure 1 also illustrates 
an example of such a two-layer MADE network, along with 
its m l (k) values and its masks. 

5. Related Work 

There has been a lot of recent work on exploring the use 
of feed-forward, autoencoder-like neural networks as prob¬ 
abilistic generative models. Part of the motivation behind 
this research is to test the common assumption that the 
use of models with probabilistic latent variables and in¬ 
tractable partition functions (such as the restricted Boltz¬ 
mann machine (Salakhutdinov & Murray, 2008)), is a nec¬ 
essary evil in designing powerful generative models for 
high-dimensional data. 

The work on the neural autoregressive distribution estimator 
or NADE (Larochelle & Murray, 2011) has illustrated that 
feed-forward architectures can in fact be used to form state- 
of-the-art and even tractable distribution estimators. 

Recently, a deep extension of NADE was proposed, improv¬ 
ing even further the state-of-the-art in distribution estima¬ 
tion (Uria et al., 2014). This work introduced a randomized 
training procedure, which (like MADE) has nearly the same 
cost per iteration as a standard autoencoder. Unfortunately, 
deep NADE models still require D feed-forward passes 
through the network to evaluate the probability p(x) of a 
D-dimensional test vector. The computation of the first 
hidden layer’s activations can be shared across these passes, 
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Table 1. Complexity of the different models in Table 6, to compute 
an exact test negative log-likelihood. R is the number of orderings 
used, D is the input size, and K is the hidden layer size (assuming 
equally sized hidden layers). 


Model 

Onll 

RBM 25 CD steps 

0(min(2 D A', D2 K )) 

DARN 

0(2 k D) 

NADE (fixed order) 

O(DK) 

EoNADE lhl, R ord. 

O(RDK) 

EoNADE 2hl, R ord. 

0{RDK 2 ) 

MADE lhl, 1 ord. 

0(DK+D 2 ) 

MADE 2hl, 1 ord. 

0{DK+K 2 +D 2 ) 

MADE lhl, R ord. 

0(R(DI<+D 2 )) 

MADE 2hl, R ord. 

0(R(DK+K 2 +D 2 )) 


although is slower in practice than evaluating a single pass 
in a standard autoencoder. In deep networks with K hidden 
units per layer, it costs 0(DK 2 ) to evaluate a test vector. 

Deep AutoRegressive Networks (DARN, Gregor et al., 
2014), also provide probabilistic models with roughly the 
same training costs as standard autoencoders. DARN’s la¬ 
tent representation consist of binary, stochastic hidden units. 
While simulating from these models is fast, evaluation of 
exact test probabilities requires summing over all config¬ 
urations of the latent representation, which is exponential 
in computation. Monte Carlo approximation is thus recom¬ 
mended. 

The main advantage of MADE is that evaluating proba¬ 
bilities retains the efficiency of autoencoders, with minor 
additional cost for simple masking operations. Table 1 lists 
the computational complexity for exact computation of prob¬ 
abilities for various models. DARN and RBMs are expo¬ 
nential in dimensionality of the hiddens or data, whereas 
NADE and MADE are polynomial. MADE only requires 
one pass through the autoencoder rather than the D passes 
required by NADE. In practice, we also observe that the 
single-layer MADE is an order of magnitude faster than a 
one-layer NADE, for the same hidden layer size, despite 
NADE sharing computation to get the same asymptotic 
scaling. NADE’s computations cannot be vectorized as 
efficiently. The deep versions of MADE also have better 
scaling than NADE at test time. The training costs for 
MADE, DARN, and deep NADE will all be similar. 

Before the work on NADE, Bengio & Bengio (2000) pro¬ 
posed a neural network architecture that corresponds to the 
special case of a single hidden layer MADE model, without 
randomization of input ordering and connectivity. A contri¬ 
bution of our work is to go beyond this special case, explor¬ 
ing deep variants and order/connectivity-agnostic training. 


Table 2. Number of input dimensions and numbers of examples in 
the train, validation, and test splits. 


Name 

# Inputs 

Train 

Valid. 

Test 

Adult 

123 

5000 

1414 

26147 

Connect4 

126 

16000 

4000 

47557 

DNA 

180 

1400 

600 

1186 

Mushrooms 

112 

2000 

500 

5624 

NIPS-0-12 

500 

400 

100 

1240 

OCR-letters 

128 

32152 

10000 

10000 

RCV1 

150 

40000 

10000 

150000 

Web 

300 

14000 

3188 

32561 


An interesting interpretation of the autoregressive mask sam¬ 
pling is as a structured form of dropout regularization (Sri- 
vastava et al., 2014). Specifically, it bears similarity with 
the masking in dropconnect networks (Wan et al., 2013). 
The exception is that the masks generated here must guar¬ 
anty the autoregressive property of the autoencoder, while 
in Wan et al. (2013), each element in the mask is generated 
independently. 

6. Experiments 

To test the performance of our model we considered 
two different benchmarks: a suite of UCI binary 
datasets, and the binarized MNIST dataset. The code to 
reproduce the experiments of this paper is available at 
https://github.com/mgermain/MADE/releases/tag/lCML2015 
The results reported here are the average negative log- 
likelihood on the test set of each respective dataset. All 
experiments were made using stochastic gradient descent 
(SGD) with mini-batches of size 100 and a lookahead of 30 
for early stopping. 

6.1. UCI evaluation suite 

We use the binary UCI evaluation suite that was first put 
together in Larochelle & Murray (2011). It’s a collection 
of 7 relatively small datasets from the University of Califor¬ 
nia, Irvine machine learning repository and the OCR-letters 
dataset from the Stanford Al Lab. Table 2 gives an overview 
of the scale of those datasets and the way they were split. 

The experiments were run with networks of 500 units per 
hidden layer, using the adadelta learning update (Zeiler, 
2012) with a decay of 0.95. The other hyperparameters 
were varied as Table 3 indicates. We note as # of masks the 
number of different masks through which MADE cycles 
during training. In the no limit case, masks are sampled 
on the fly and never explicitly reused unless re-sampled by 
chance. In this situation, at validation and test time, 300 and 
1000 sampled masks are used for averaging probabilities. 
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Table 3. UCI Grid Search Table 5. Binarized MNIST Grid Search 


Hyperparameter 

Values tried 

Hyperparameter 

Values tried 

# Hidden Layer 

1,2 

# Hidden Layer 

1,2 

Activation function 

ReLU, Softplus 

Learning Rate 

0.1,0.05,0.01,0.005 

Adadelta epsilon 

1(T 5 , 10~ 7 , 10“ 9 

# of masks 

1,2, 4, 8, 16, 32,64 

Conditioning Weights 

True, False 



# of orderings 

1,8, 16, 32, No Limit 




The results are reported in Table 4. We see that MADE is 
among the best performing models on half of the datasets 
and is competitive otherwise. To reduce clutter, we have not 
reported standard deviations, which were fairly small and 
consistent across models. However, for completeness we 
report standard deviations in a separate table in the supple¬ 
mentary materials. 

An analysis of the hyperparameters selected for each dataset 
reveals no clear winner. However, we do see from Table 4 
that when the mask sampling helps, it helps quite a bit and 
when it does not, the impact is negligible on all but OCR- 
letters. Another interesting note is that the conditioning 
weights had almost no influence except on NIPS-0-12 where 
it helped. 

6.2. Binarized MNIST evaluation 

The version of MNIST we used is the one binarized by 
Salakhutdinov & Murray (2008). MNIST is a set of 70,000 
hand written digits of 28x28 pixels. We use the same split 
as in Larochelle & Murray (201 1), consisting of 50,000 for 
the training set, 10,000 for the validation set and 10,000 for 
the test set. 

Experiments were run using the adagrad learning up¬ 
date (Duchi et al., 2010), with an epsilon of 10“ 6 . Since 
MADE is much more efficient than NADE, we considered 
varying the hidden layer size from 500 to 8000 units. Seeing 
that increasing the number of units tended to always help, 
we used 8000. Even with such a large hidden layer, our 
GPU implementation of MADE was quite efficient. Using 
a single mask, one training epoch requires about 14 and 44 
seconds, for one hidden layer and two hidden layer MADE 
respectively. Using 32 sampled masks, training time in¬ 
creases to 33 and 100 respectively. These timings are all 
less than our GPU implementation of the 500 hidden units 
NADE model, which requires about 130 seconds per epoch. 
These timings were obtained on a K20 NVIDIA GPU. 

Building on what we learned on the UCI experiments, we 
set the activation function to be ReLU and the conditioning 
weights were not used. The hyperparameters that were 
varied are in Table 5. 

The results are reported in Table 6, alongside other results 



Figure 2. Impact of the number of masks used with a single hidden 
layer, 500 hidden units network, on binarized MNIST. 


taken from the literature. Again, despite its tractability, 
MADE is competitive with other models. Of note is the 
fact that the best MADE model outperforms the single layer 
NADE network, which was otherwise the best model among 
those requiring only a single feed-forward pass to compute 
log probabilities. 

In these experiments, we clearly observed the over¬ 
regularization phenomenon from using too many masks. 
When more than four orderings were used, the deeper vari¬ 
ant of MADE always yielded better results. For the two 
layer model, adding masks during training helped up to 64, 
at which point the negative log-likelihood started to increase. 
We observed a similar pattern for the single layer model, but 
in this case the dip was around 8 masks. Figure 2 illustrates 
this behaviour more precisely for a single layer MADE with 
500 hidden units, trained by only varying the number of 
masks used and the size of the mini-batches (83, 100, 128). 

We randomly sampled 100 digits from our best performing 
model from Table 6 and compared them with their nearest 
neighbor in the training set (Figure 3), to ensure that the 
generated samples are not simple memorization. Each row 
of digits uses a different mask that was seen at training time 
by the network. 
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Table 4. Negative log-likelihood test results of different models on multiple datasets. The best result as well as any other result with an 
overlapping confidence interval is shown in bold. Note that since the variance of DARN was not available, we considered it to be zero. 


Model 

Adult 

Connect4 

DNA 

Mushrooms 

NIPS-0-12 

OCR-letters 

RCV1 

Web 

MoBemoullis 

20.44 

23.41 

98.19 

14.46 

290.02 

40.56 

47.59 

30.16 

RBM 

16.26 

22.66 

96.74 

15.15 

277.37 

43.05 

48.88 

29.38 

FVSBN 

13.17 

12.39 

83.64 

10.27 

276.88 

39.30 

49.84 

29.35 

NADE (fixed order) 

13.19 

11.99 

84.81 

9.81 

273.08 

27.22 

46.66 

28.39 

EoNADE lhl (16 ord.) 

13.19 

12.58 

82.31 

9.69 

272.39 

27.32 

46.12 

27.87 

DARN 

13.19 

11.91 

81.04 

9.55 

274.68 

«28.17 

«46.10 

«28.83 

MADE 

MADE mask sampling 

13.12 

13.13 

11.90 

11.90 

83.63 

79.66 

9.68 

9.69 

280.25 

277.28 

28.34 

30.04 

47.10 

46.74 

28.53 

28.25 




Figure 3. Left: Samples from a 2 hidden layer MADE. Right: Nearest neighbour in binarized MNIST. 


Table 6. Negative log-likelihood test results of different models on 
the binarized MNIST dataset. 


Model 

-logp 


RBM (500 h, 25 CD steps) 

« 86.34 


DBM 2hl 

« 84.62 

25 

DBN 2hl 

« 84.55 

y 

DARN n h = 500 

« 84.71 

S-H 

DARN rih= 500, adaNoise 

« 84.13 


MoBemoullis K=10 

168.95 


MoBemoullis K=500 

137.64 


NADE lhl (fixed order) 

88.33 


EoNADE lhl (128 orderings) 

87.71 


EoNADE 2hl (128 orderings) 

85.10 

o 

MADE lhl (1 mask) 

88.40 

£ 

MADE 2hl (1 mask) 

89.59 


MADE lhl (32 masks) 

88.04 


MADE 2hl (32 masks) 

86.64 



7. Conclusion 

We proposed MADE, a simple modification of autoencoders 
allowing them to be used as distribution estimators. MADE 
demonstrates that it is possible to get direct, cheap estimates 
of high-dimensional joint probabilities, from a single pass 
through an autoencoder. Like standard autoencoders, our ex¬ 
tension is easy to vectorize and implement on GPUs. MADE 
can evaluate high-dimensional probably distributions with 
better scaling than before, while maintaining state-of-the-art 
statistical performance. 
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Table 7. Negative log-likelihood and 95% confidence intervals for 
Table 4 in the main document. 


Dataset 

MADE 

EoNADE 

16 ord. 

Fixed mask 

Mask sampling 

Adult 

13.12 ±0.05 

13.13± o.oj 

13.19 ±0.04 

Connect4 

11.90 ±o.oi 

11.90 ±o.oi 

12.58 ±o.oi 

DNA 

83.63io.52 

79.66 ±o.63 

82.31 ±0.46 

Mushrooms 

9.68 ±0.04 

9.69 ±0.03 

9.69 ±0.03 

NIPS-0-12 

280.25 ±i.o5 

275.92±i.o; 

272.39 urn 

Ocr-letters 

28.34 ±0.22 

30.04io.2i 

2132±o.i9 

RCV1 

47.10io._ti 

46.14±o.n 

46.12 ±o.n 

Web 

28.53 ±0.20 

28.25 ±0.20 

27.87 ±0.20 


Table 8. Binarized MNIST negative log-likelihood and 95% confi¬ 
dence intervals for Table 6 in the main document. 


Model 

MADE lhl (1 mask) 

88.40±o.« 

MADE 2hl (1 mask) 

89.59±o.-«) 

MADE lhl (32 masks) 

88.04±o..« 

MADE 2hl (32 masks) 

86.64io.it 




